To illustrate these models further, let us consider a simple AR(1) process and carry out the following iterative exercise

[\begin{aligned} {y_t} &= {\phi 1}{y{t - 1}} + {w_t} \ &= {\phi 1}({\phi _1}{y{t - 2}} + {w_{t-1}}) + {w_t} \ &= {\phi 1^2}{y{t - 2}} + {\phi 1}{w{t-1}} + {w_t} \ &= ... \ &= {\phi 1^h}{y{t - h}} + \sum_{i=0}^{h-1}{\phi 1^i}w{t-i} \end{aligned}].

If we suppose that $|\phi_1| < 1$, then continuing the iteration to infinity would lead us to $$y_t = \sum_{i=0}^{\infty}{\phi 1^i}w{t-i}$$, which has $\mathbb{E}[y_t] = 0$ and autocovariance function $$\gamma(h) = \frac{\phi_1^h}{1-\phi_1^2}\sigma _w^2$$ where $\sigma _w^2$ is the variance of the white noise process $(w_t)$. It must be noted that if $\phi_1 = 1$ then the AR(1) becomes a random walk process (non-stationary) whereas if $|\phi_1| < 1$ the process is stationary.

For these kind of models, the ACF and PACF can give meaningful insight into the order of the autoregressive model. The following plots show the ACF and PACF of an AR(1) with $\phi_1 = 0.8$ indicating that the ACF of an AR process decays exponentially while its PACF cuts off after the lag corresponding to the order of the autoregressive process.

INSERT R CODE FOR ACF AND PACF OF SIMULATED AR(1)

The condition for the stationarity of an AR(1) model have been partially discussed above. However, the condition of $|\phi_1| < 1$ is not the only one that guarantees that the AR(1) is stationary. Indeed, so-called explosive AR(1) models with $|\phi_1| > 1$ can be re-expressed, using the same kind of iterative argument above, as a stationary process. Nevertheless this iterative procedure makes use of future values of the series $(y_t)$ and therefore is not a so-called "causal" process. Generally speaking, by causal process we mean a process whose current values are only explained by its past values and not by its future ones.

DISCUSS CAUSALITY MORE IN DEPTH ALSO OF AR(p)

Moving Average Models (MA(q))

A moving average model can be interpreted in a similar way to an AR(p) model, except that in this case the time series is the result of a linear operation on the innovation process rather than on the time series itself. More specifically, an MA(q) model can be defined as follows $${y_t} = \theta_1 w_{t-1} + ... + \theta_q w_{t-q} + w_t$$. Following the same logic as for the AR(p) models and using the backshift operator as before, we can re-express these moving average processes as follows $$y_t = \theta(B)w_t$$ where $\theta(B) = 1 + \theta_1B + ... + \theta_qB^q$.

These processes are always stationary, no matter the values that $\theta_q$ takes. However, the MA(q) processes may not be identifiable through their autocovariance functions. By the latter we mean that different parameteres for a same order MA(q) model can deliver the exact same autocovariance function and it would therefore be impossible to retrieve the parameters of the model by only looking at the autocovariance function.

DICUSS INVERTIBILITY

AutoRegressive Moving Average Models (ARMA(p,q))

The objective behind this chapter is to talk about AutoRegressive Moving Average (ARMA) models.

ARMA Definition

MA / AR Operators

We define the AR and MA polynomials, which will be used often through this chapter.

Define the AR and MA polynomials respectively as \begin{equation} \phi(z) = 1 - \phi_1 z - ... - \phi_p z^p, \mbox{ } \phi_p \neq 0, \end{equation} and \begin{equation} \theta(z) = 1 + \theta_1 z + ... + \theta_p z^p, \mbox{ } \theta_p \neq 0, \end{equation} where $z$ is a complex number.

Redundancy

Parameter redundancy is a problem which makes an ARMA(p, q) model has more parameters than necessary. Especially when we want to estimate parameters of a proposed ARMA(p, q) model, if this model has parameter redundancy, then it is not Identifiable (i.e. parameters of this model are not unique). In addition, Causality and Invertibility can only be assessed for ARMA(p, q) models without parameter redundancy.

Parameter redundancy can be checked by using AR and MA polynomials. A sufficient condition for parameter redundancy is AR and MA polynomials have common factors.

Example: Redundancy Consider the following model

$$\begin{equation} X_t = 0.4X_{t-1} + 0.45X_{t-2} + W_t + W_{t-1} + 0.25W_{t-2}, \end{equation}$$ rewrite it with backshift operator,

$$\begin{equation} (1 - 0.4B - 0.45B^2)X_t = (1 + B + 0.25B^2)W_t, \end{equation}$$

it looks like an ARMA(2, 2), but since

$$\begin{equation} (1 + 0.5B)(1 - 0.9B)X_t = (1 + 0.5B)^2W_t, \end{equation}$$

the AR polynomials

$$\begin{equation} \phi(z) = (1 + 0.5z)(1 - 0.9z), \end{equation}$$

and the MA polynomials

$$\begin{equation} \theta(z) = (1 + 0.5z)^2, \end{equation}$$ have common factor, there is an issue of redundancy. By removing the common factor, we can obtain an actual ARMA(1, 1) model,

$$\begin{equation} (1 - 0.9B)X_t = (1 + 0.5B)W_t. \end{equation}$$

Causal + Invertible

Definitions

(Def. 3.7, p.94). An ARMA(p, q) model is causal, if the time series ${ X_t }{-\infty}^{\infty}$ can be written as a one-sided linear process: \begin{equation} X_t = \sum{j = 0}^{\infty} \psi_j W_{t-j} = \psi(B) W_t, (#eq:causal) \end{equation} where $\psi(B) = \sum_{j = 0}^{\infty} \psi_j B^j$, and $\sum_{j=0}^{\infty}|\psi_j| < \infty$; we set $\psi_0 = 1$.

(Def. 3.8, p.95). An ARMA(p, q) model is invertible, if the time series ${ X_t }{-\infty}^{\infty}$ can be written as \begin{equation} \pi(B)X_t = \sum{j = 0}^{\infty} \pi_j X_{t-j} = W_t, (#eq:invertible) \end{equation} where $\pi(B) = \sum_{j = 0}^{\infty} \pi_j B^j$, and $\sum_{j=0}^{\infty}|\pi_j| < \infty$; we set $\pi_0 = 1$.

It might be difficult and not obvious to show the causality or the invertibility of an ARMA process (especially the higher order process) by using these definitions directly, thus the following properties are useful in practice.

Property: Causality If an ARMA(p, q) model is causal, then the coefficients of the one-sided linear process given in (\@ref(eq:causal)) can be obtained by \begin{equation} \psi(z) = \sum_{j=0}^{\infty} \psi_j z^j = \frac{\theta(z)}{\phi(z)}, \mbox{ } |z| \leq 1. \end{equation}

Thus, we have the following results for causality

Property: Invertibility If an ARMA(p, q) model is invertible, then the coefficients of the one-sided linear process given in (\@ref(eq:invertible)) can be obtained by \begin{equation} \pi(z) = \sum_{j=0}^{\infty} \pi_j z^j = \frac{\phi(z)}{\theta(z)}, \mbox{ } |z| \leq 1. \end{equation}

Thus, we have the following results for causality

Example: Causal Conditons for an AR(2) Process We already know that an AR(1) is causal with the simple condition $|\phi_1| < 1$. It could seem natural to believe that an AR(2) should be causal (implies stationary) with the conditon: $|\phi_i| < 1, \, i = 1,2$, however, this is not the case. Indeed, an AR(2) can be expressed as

[X_t = \phi_1 X_{t-1} + \phi_2 X_{t-2} + W_t = \phi_1 B X_t + \phi_2 B^2 X_t + W_t,]

corresponding to the following autoregressive operator:

[\phi(z) = 1 - \phi_1 z - \phi_2 z^2.]

Therefore, the process is causal when the roots of $\phi(z)$ lies outside of the unit circle. Letting $z_1$ and $z_2$ denote those roots, we impose the following constraints to ensure the causality of the model:

[\begin{aligned} |z_1| &> 1, \;\;\;\; \text{where} \;\; &z_1 = \frac{\phi_1 + \sqrt{\phi_1^2 + 4\phi_2}}{-2 \phi_2},\ |z_2| &> 1, \;\;\;\; \text{where} \;\; &z_2 = \frac{\phi_1 - \sqrt{\phi_1^2 + 4\phi_2}}{-2 \phi_2}, \end{aligned}] note that $z_1$ and $z_2$ can be complex values.

Thus we can represent $\phi_1$ and $\phi_2$ by $z_1$ and $z_2$, [\begin{aligned} \phi_1 = (z_1^{-1} + z_2^{-1}),\ \phi_2 = -(z_1 z_2)^{-1}. \end{aligned}]

Moreover we have the following equivalent condition for causality:

$$\begin{cases} |z_1| &> 1\ |z_2| &> 1 \end{cases}$$ if and only if $$\begin{cases} \phi_1 + \phi_2 &< 1\ \phi_2 - \phi_1 &< 1\ |\phi_2| &< 1 \end{cases}$$

We can show "if"

$\phi_1 + \phi_2 = \frac{1}{z_1} + \frac{1}{z_2} - \frac{1}{z_1z_2} = \frac{1}{z_1}\left( 1 - \frac{1}{z_2} \right) + \frac{1}{z_2} < 1 - \frac{1}{z_2} + \frac{1}{z_2} = 1$, (since $\left( 1 - \frac{1}{z_2} \right) > 0$)

$\phi_2 - \phi_1 = - \frac{1}{z_1z_2} - \frac{1}{z_1} - \frac{1}{z_2} = - \frac{1}{z_1}\left( \frac{1}{z_2} + 1 \right) - \frac{1}{z_2} < \frac{1}{z_2} + 1 - \frac{1}{z_2} = 1$, (since $\left( \frac{1}{z_2} + 1 \right) > 0$)

$|\phi_2| = \frac{1}{|z_1||z_2|} < 1$

We can also show "only if"

Since $z_1 = \frac{\phi_1 + \sqrt{\phi_1^2 + 4\phi_2}}{-2 \phi_2}$ and $\phi_2-1 < \phi_1 < 1-\phi_2$, then $z_1^2 = \frac{\left(\phi_1 + \sqrt{\phi_1^2 + 4\phi_2}\right)^2}{4 \phi_2^2} < \frac{\left((1-\phi_2) + \sqrt{(1-\phi_2)^2 + 4\phi_2}\right)^2}{4 \phi_2^2} = \frac{4}{4 \phi_2^2} \leq 1$.

Since $z_2 = \frac{\phi_1 - \sqrt{\phi_1^2 + 4\phi_2}}{-2 \phi_2}$ and $\phi_2-1 < \phi_1 < 1-\phi_2$, then $z_2^2 = \frac{\left(\phi_1 - \sqrt{\phi_1^2 + 4\phi_2}\right)^2}{4 \phi_2^2} < \frac{\left((\phi_2-1) + \sqrt{(\phi_2-1)^2 + 4\phi_2}\right)^2}{4 \phi_2^2} = \frac{4 \phi_2^2}{4 \phi_2^2} = 1$.


Now we know the causal condition for an AR(2) model, the following example will show you how to convert a causal AR(2) into a linear process.

Example: Convert AR(2) into a linear process Given an AR(2) model: $X_t = 1.3X_{t-1} - 0.4X_{t-2} + W_t$.

Step 1. The AR polynomial can be written as:

[\phi(z) = 1 - 1.3z + 0.4z^2 = (1 - 0.5z)(1 - 0.8z) ,]

it has roots $2 > 1$ and $1.25 > 1$. Thus we are able to covert it into linear process.

Step 2. If an AR process $\phi(B)X_t = W_t$ is causal, we can write it as $X_t = \phi^{-1}(B)W_t$. This step is to inverse $\phi(B)$.

[\phi^{-1}(z) = \frac{1}{(1 - 0.5z)(1 - 0.8z)} = \frac{C_1}{(1 - 0.5z)} + \frac{C_2}{(1 - 0.8z)} = \frac{C_2(1 - 0.5z) + C_1(1 - 0.8z)}{(1 - 0.5z)(1 - 0.8z)} ,]

Since we can think of the above equation is valid for any z, we will solve the following equations to get $C_1$ and $C_2$,

[\begin{cases} C_1 + C_2 &= 1\ -0.5C_2 - 0.8C_1 &< 1 \end{cases} \implies \begin{cases} C_1 &= -\frac{5}{3}\ C_2 &= \frac{8}{3} \end{cases}. ]

Thus, [\phi^{-1}(z) = \frac{-5}{3(1 - 0.5z)} + \frac{8}{3(1 - 0.8z)} .]

Step 3. Using Geometric series: $a \sum_{k = 0}^{\infty}r^k = \frac{a}{1-r}, \mbox{ if } |r| < 1$.

[\begin{cases} \frac{-5}{3(1 - 0.5z)} &= \frac{-5}{3} \sum_{j = 0}^{\infty}0.5^jz^j , \mbox{ if } |z| < 2\ \frac{8}{3(1 - 0.8z)} &= \frac{8}{3} \sum_{j = 0}^{\infty}0.8^jz^j , \mbox{ if } |z| < 1.25 \end{cases}.]

So, [\phi^{-1}(z) = \sum_{j = 0}^{\infty} \left[ \frac{-5}{3} (0.5)^j + \frac{8}{3} (0.8)^j \right] z^j, \mbox{ if } |z| < 1.25 .]

Step 4. [X_t = \phi^{-1}(B)W_t = \sum_{j = 0}^{\infty} \left[ \frac{-5}{3} (0.5)^j + \frac{8}{3} (0.8)^j \right] B^j W_t = \sum_{j = 0}^{\infty} \left[ \frac{-5}{3} (0.5)^j + \frac{8}{3} (0.8)^j \right] W_{t-j} . ]

Autocorrelation of AR models

Recall the defination of ACF in Chapter 2. Although the ACF for certain non-causal AR models do exist, in this part we only focus on the ACF for causal AR models. Next we give an example with the ACF of AR(2) model.

Example: ACF of AR(2) We use the same AR(2) model as previous example: $X_t = 1.3X_{t-1} - 0.4X_{t-2} + W_t$.

Step 1. Find the homogeneous difference equation with repsect to the ACF $\rho(h)$.

As it is shown in previous example, our model is causal. Follow the precedure of (Ex. 3.9, p.99), we have, [\rho(h) - 1.3\rho(h-1) + 0.4\rho(h-2) = 0, \mbox{ } h = 1,2,...] and the initial conditions are $\rho(0) = 1$ and $\rho(-1) = \frac{13}{14}$.

Step 2. Solve the roots of the AR polynomial.

The AR polynomial can be written as: [\phi(z) = 1 - 1.3z + 0.4z^2 = (1 - 0.5z)(1 - 0.8z) ,]

it has roots $z_1 = 2 > 1$ and $z_2 = 1.25 > 1$. Since $z_1$ and $z_2$ are real and distinct, the solution would take the form [\rho(h) = c_1z_1^{-h} + c_2z_2^{-h} .]

Step 3. Solve $c_1$ and $c_2$ based on two initial conditions. [\begin{cases} \rho(0) = c_1 + c_2 = 1\ \rho(-1) = 2c_1 + 1.25c_2 = \frac{13}{14} \end{cases},] then we have $c_1 = -\frac{3}{7}$ and $c_2 = \frac{10}{7}$. Then the ACF is [\rho(h) = -\frac{3}{7}2^{-h} + \frac{10}{7}\left(\frac{5}{4}\right)^{-h} .]

Next we present an example to show how to derive the ACF for general causal AR(p) model. It is much more complicated than what we did for AR(2).

Example: ACF of AR(p) Recall that the AR(p) models can be formally represented as follows $${X_t} = {\phi_1}{X_{t - 1}} + ... + {\phi_p}{X_{t - p}} + {W_t},$$ where $\phi_p \neq 0$ and $W_t$ is a (Gaussian) white noise process with variance $\sigma^2$.

As it is shown in Chapter 3.4 of textbook, there are two ways to derive the ACF for general AR(p) models.

For the first method, since we assume our AR(p) model is causal, we can write it as a one-sided linear process: $X_t = \sum_{j = 0}^{\infty} \psi_j W_{t-j}$, then the autocovariance function $\gamma(h) = cov(X_{t+h},X_t) = \sigma^2 \sum_{j=0}^{\infty} \psi_j \psi_{j+h}$, for $h \geq 0$. The difficulty of this method is to solve the $\psi$-weights. Treating AR(p) as a special case of ARMA(p,q), we can let the MA polynomials $\theta(z) = 1$ and solve the $\psi$-weights by matching the coefficients in $\phi(z)\psi(z) = \theta(z)$. The detailed precedure is listed on Chapter 3.3 of textbook.

For the second method, we can also use the procedure we used for AR(2). That is finding a homogeneous difference equation with respect to ACF $\rho(h)$, and solve it directly. To do so we need following steps.

Step 1. Find the homogeneous difference equation with repsect to the ACF $\rho(h)$.

Firstly verify whether our model is causal. If it is then we have, [\rho(h) - \phi_1\rho(h-1) - \cdots - \phi_p\rho(h-p) = 0, \mbox{ } h \geq p. ]

Step 2. Solve the roots of the associated AR polynomials.

The polynomials can be written as: [\phi(z) = 1 - \phi_1z - \cdots - \phi_pz^p .]

Suppose the polynomials have $r$ distinct roots, and let $m_j$ denoted the number of replicates of $z_j$, such that $m_1 +m_2 + \cdots +m_r = p$.

Then the general solution of the homogeneous difference equation is [\rho(h) = z_1^{-h}P_1(h) + z_2^{-h}P_2(h) + \cdots + z_r^{-h}P_r(h), \mbox{ } h \geq p,] where $P_j(h)$ is the polynomial in $h$ of degree $m_j - 1$.

Step 3. Solve every $P_j(h)$ based on $p$ given initial conditions on $\rho(h)$.

Remark Since the AR(p) model is causal, the roots we obtained in step 2 should be outside of the unit circle (i.e. $|z_i| > 1$, for $i = 1, \cdots, r$). Then the absolute value of the general solution [|\rho(h)| = \left|\frac{P_1(h)}{z_1^{h}} + \frac{P_2(h)}{z_2^{h}} + \cdots + \frac{P_r(h)}{z_r^{h}}\right| \leq \left|\frac{P_1(h)}{z_1^{h}}\right| + \left| \frac{P_2(h)}{z_2^{h}}\right| + \dots + \left|\frac{P_r(h)}{z_r^{h}}\right| \leq \frac{r \left|P_r(h)\right|}{\min_{j = 1, \dots, r} \left|z_j \right|^{h}},]

from the right hand side of the last inequality, we can find the rate of convergence would be dominated by $\frac{1}{\min_{j = 1, \dots, r} \left|z_j \right|^{h}}$. Thus the ACF $\rho(h)$ will goes to zero exponentially as $h \to \infty$.

Remark From the example and remark, we know that a causal AR(p) can be written as a one-sided linear process: $X_t = \sum_{j = 0}^{\infty} \psi_j W_{t-j}$, and the autocovariance function $\gamma(h) = cov(X_{t+h},X_t) = \sigma^2 \sum_{j=0}^{\infty} \psi_j \psi_{j+h}$, for $h \geq 0$. And the ACF $\rho(h)$ of AR(p) will goes to zero exponentially as $h \to \infty$.

Now, suppose we have another one-sided linear process: $X_t = \sum_{j = 0}^{q} \psi_j W_{t-j}$, but it is truncated at order q. Then let $h>0$, $X_{t+h} = \sum_{j = 0}^{q} \psi_j W_{t+h-j}$. Then the the autocovariance function [\gamma(h) = cov(X_{t+h},X_t) = \begin{cases} \sigma^2 \sum_{j=0}^{q-h} \psi_j \psi_{j+h} & 1 \leq h \leq q\ 0 & h > q \end{cases}. ] It is shown that this truncated one-sided linear process (MA(q) model) would always has ACF $\rho(h) = 0$ when $h > q$. Thus ACF is a very informative tool to identify this order q of the truncated one-sided linear process (MA(q) model). However a causal AR(p) is a one-sided linear process with infinite order, so that ACF do not tell us anything information on the order of an AR model.

However a similar statistic of ACF, called the partial autocorrelation function is quite informative about order of AR(p).

Partial autocorrelation of AR models

{definition, label="pacf","Partial autocorrelation of AR models"} For a stationary process, $x_t$, the partial autocorrelation function (PACF) can be denoted as $\phi_{hh}$, for $h = 1, 2, \dots,$ which are $$\phi_{11} = \corr(X_{t+1}, X_t) = \rho(1),$$ and $$\phi_{hh} = \corr(X_{t+h} - \hat{X}_{t+h}, X_t - \hat{X}_t), \mbox{ } h \geq 2.$$

Remark From the above defination we can think of the partial correlation $\phi_{hh}$ as the correlation between the residual of $X_{t+h}$ after removing its best linear predictor on the vector space spanned by ${ X_{t+1}, \dots, X_{t+h-1} }$ and the residual of $X_t$ after reomving its best linear predictor on the same vector space. Similar to the linear regression, after projecting the residuals should be independ with the vector space spanned by ${ X_{t+1}, \dots, X_{t+h-1} }$.

We will discuss about the best linear predictor in the section Forecasing, here we just mention how to obtain the best linear predictor. Given a time serie $X_1, X_2, \dots, X_t$ with zero mean, the best linear predictor of $X_{t+h}$ can be written as $\hat{X}{t+h} = \sum{j=1}^t \alpha_j X_j$ such than it satisfies the prediction equations: [ \mathbb{E}(X_{t+h} - \hat{X}{t+h}) = 0, ] and [ \mathbb{E} [(X{t+h} - \hat{X}{t+h})X_j ] = 0, \mbox{ for } i = 1, \dots, t.] According to the projection theorem, we can show that satisfying the prediction equations is equivalent to minimizing the mean square error $\mathbb{E}(X{t+h} - \hat{X}_{t+h})$. Thus we have two equivalent to obtain the best linear predictor.

Example: PACF of AR(1) Consider a casual AR(1) model ${X_t} = \phi {X_{t - 1}} + {W_t},$ We have, [\phi_{11} = \corr(X_{t+1}, X_t) = \rho(1) = \phi,] and [\phi_{22} = \corr(X_{t+2} - \hat{X}{t+2}, X_t - \hat{X}_t),] Next, we find $\hat{X}_t$ and $\hat{X}{t+2}$. Since $X_{t+1}$ is the only random between $X_t$ and $X_{t+2}$, $\hat{X}t$ and $\hat{X}{t+2}$ are the best linear predictors on the vector space spaned by $X_t$, we can obtain them by minimizing the MSE. [\mathbb{E}(X_{t+2} - \hat{X}{t+2})^2 = \mathbb{E}(X{t+2} - \beta_1 X_{t+1})^2 = \gamma(0) - 2\beta_1 \gamma(1) + \beta_1^2 \gamma(0),] Then by minimizing the MSE, we have $\beta_1 = \frac{\gamma(1)}{\gamma(0)} = \phi$.

Similarily, by minimizing [\mathbb{E}(X_{t} - \hat{X}{t})^2 = \mathbb{E}(X{t} - \beta_2 X_{t+1})^2 = \gamma(0) - 2\beta_2 \gamma(1) + \beta_2^2 \gamma(0),] we have $\beta_2 = \frac{\gamma(1)}{\gamma(0)} = \phi$.

Or equivalently we can use the prediction equations. Thus we have [\mathbb{E}[(X_{t+2} - \hat{X}{t+2})X{t+1}] = \mathbb{E}[(X_{t+2}X_{t+1} - \beta_1 X_{t+1}^2)] = \gamma(1) - \beta_1 \gamma(0) = 0,] and [\mathbb{E}[(X_{t} - \hat{X}{t})X{t+1}] = \mathbb{E}[(X_{t}X_{t+1} - \beta_2 X_{t+1}^2)] = \gamma(1) - \beta_2 \gamma(0) = 0,] Thus we can get the same solutions.

Therefore, [\phi_{22} = \corr(X_{t+2} - \phi X_{t+1}, X_t - \phi X_{t+1}) = \corr(W_{t+2}, X_t - \phi X_{t+1}) = 0,] note that the last equation is based on casuality.

Example: PACF of AR(p) In this example we would like to show that the PACF characterize the order of AR(p) models. That is when $h > p$, the PACF $\phi_{hh} = 0$. Suppose a causal AR(p) model, $X_{t+h} = \sum_{j=1}^p \phi_j X_{t+h-j} + W_{t+h}$, we want to calculate $\phi_{hh}$.

The best linear predictor of $X_{t+h}$ is [\hat{X}{t+h} = \mathbb{E}\left[X{t+h}|X_t, \dots, X_{t+h-1}\right] = \mathbb{E}\left[\sum_{j=1}^p \phi_j X_{t+h-j} + W_{t+h}|X_t, \dots, X_{t+h-1}\right] = \sum_{j=1}^p \mathbb{E}\left[\phi_j X_{t+h-j}|X_t, \dots, X_{t+h-1}\right] = \sum_{j=1}^p \phi_j X_{t+h-j}.] Thus when $h > p$, [ \phi_{hh} = corr(X_{t+h} - \hat{X}{t+h}, X_t - \hat{X}_t) = corr(W{t+h}, X_t - \hat{X}_t) = 0. ]

Therefore, when the lag is greater than p the PACF is zero.

In summary, we have two general results to identify time series models.

(1) If the ACF after a certain lag (q) is zero, it suggests an MA(q) model for the time series. However, the ACF of any AR(p) process will only decay to zero as the lag increases. (2) If the PACF after a certain lag (p) is zero, it suggests an AR(p) model for the time series. However, the partial covariances of any MA(p) process will only decay to zero as the lag increases.

Estimation of Parameters

Consider a time series given by $x_t \sim ARMA(p,q)$. This gives us with a paramter space $\Omega$ that looks like so:

[\vec \varphi = \left[ {\begin{array}{*{20}{c}} {{\phi _1}} \ \vdots \ {{\phi _p}} \ {{\theta _1}} \ \vdots \ {{\theta _q}} \ {{\sigma ^2}} \end{array}} \right]]

In order to estimate this parameter space, we must assume the following three conditions:

  1. The process is casual
  2. The process is invertible
  3. The process has Gaussian innovations.

Innovations are a time series equivalent to residuals. That is, an innovation is given by ${x_t} - \hat x_t^{t - 1}$, where $\hat x_t^{t - 1}$ is the prediction at time $t$ given $t-1$ observations and ${x_t}$ is the true value observed at time $t$.

There are two main ways of performing such an estimation of the parameter space.

  1. Maximum Likelihood / Least Squares Estimation [MLE / LSE]
  2. Method of Moments (MoM)

To begin, we'll explore using the MLE to perform the estimation.

Maximum Likelihood Estimation

Definition Consider $X_n = (X_1, X_2, \ldots, X_n)$ with the joint density $f(X_1, X_2, \ldots, X_n ; \theta)$ where $\theta \in \Theta$. Given $X_1 = x_1, X_2 = x_2, \ldots, X_n = x_n$ is observed, we have the likelihood function of $\theta$ as

$$L(\theta) = L(\theta|x_1,x_2, \ldots, x_n) = f(x_1,x_2, \ldots, x_n | \theta)$$

If the $X_i$ are iid, then the likelihood simplifies to:

[L(\theta) = \prod\limits_{i = 1}^n {f\left( {{x_i}|\theta } \right)} ]

However, that's a bit painful to maximize with calculus. So, we opt to use the log of the function since derivatives are easier and the logarithmic function is always increasing. Thus, we traditionally use:

[l\left( \theta \right) = \log \left( {L\left( \theta \right)} \right) = \sum\limits_{i = 1}^n {\log \left( {f\left( {{x_i}|\theta } \right)} \right)} ]

From maximizes the likelihood function $L(\theta)$, we get the maximum likelihood estimate (MLE) of $\theta$. So, we end up with a value that makes the observed data the "most probable."

Note: The likelihood function is not a probability density function.

$AR(1)$ with mean $\mu$

Consider an $AR(1)$ process given as $y_t = \phi y_{t-1} + w_t$, ${w_t}\mathop \sim \limits^{iid} N\left( {0,{\sigma ^2}} \right)$, with $E\left[ {{y_t}} \right] = 0$, $\left| \phi \right| < 1$.

Let $x_t = y_t + \mu$, so that $E\left[ {{x_t}} \right] = \mu$.

Then, ${x_t} - \mu = {y_t}$. Substituting in for $y_t$, we get:

[\begin{aligned} y_t &= \phi y_{t-1} + w_t \ \underbrace {\left( {{x_t} - \mu } \right)}{ = {y_t}} &= \phi \underbrace {\left( {{x{t - 1}} - \mu } \right)}{ = {y_t}} + {w_t} \ {x_t} &= \mu + \phi \left( {{x{t - 1}} - \mu } \right) + {w_t} \end{aligned}]

In this case, $x_t$ is an $AR(1)$ process with mean $\mu$.

This means that we have:

  1. $E\left[ {{x_t}} \right] = \mu$
  2. [\begin{aligned} Var\left( {{x_t}} \right) &= Var\left( {{x_t} - \mu } \right) \hfill \ &= Var\left( {{y_t}} \right) \hfill \ &= Var\left( {\sum\limits_{j = 0}^\infty {{\phi ^j}{w_{t - j}}} } \right) \hfill \ &= \sum\limits_{j = 0}^\infty {{\phi ^{2j}}Var\left( {{w_{t - j}}} \right)} \hfill \ &= {\sigma ^2}\sum\limits_{j = 0}^\infty {{\phi ^{2j}}} \hfill \ &= \frac{{{\sigma ^2}}}{{1 - {\phi ^2}}},{\text{ since }}\left| \phi \right| < 1{\text{ and }}\sum\limits_{k = 0}^n {a{r^k}} = \frac{a}{{1 - r}} \hfill \ \end{aligned} ]

So, $x_t \sim N\left({ \mu, \frac{{{\sigma ^2}}}{{1 - {\phi ^2}}} }\right)$.

Note that the distribution of $x_t$ is normal and, thus, the density function of $x_t$ is given by: [\begin{aligned} f\left( {{x_t}} \right) &= \sqrt {\frac{{1 - {\phi ^2}}}{{2\pi {\sigma ^2}}}} \exp \left( { - \frac{1}{2} \cdot \frac{{1 - {\phi ^2}}}{{{\sigma ^2}}} \cdot {{\left( {{x_t} - \mu } \right)}^2}} \right) \hfill \ &= {\left( {2\pi } \right)^{ - \frac{1}{2}}}{\left( {{\sigma ^2}} \right)^{ - \frac{1}{2}}}{\left( {1 - {\phi ^2}} \right)^{\frac{1}{2}}}\exp \left( { - \frac{1}{2} \cdot \frac{{1 - {\phi ^2}}}{{{\sigma ^2}}} \cdot {{\left( {{x_t} - \mu } \right)}^2}} \right) \textrm{ [1]} \ \end{aligned} ]

We'll call the last equation [1].

Conditioning time $x_t | x_{t-1}$

Now, consider $x_t | x_{t-1}$ for $t > 1$.

The mean is given by:

[\begin{aligned} E\left[ {{x_t}|{x_{t - 1}}} \right] &= E\left[ {\mu + \phi \left( {{x_{t - 1}} - \mu } \right) + {w_t}|{x_{t - 1}}} \right] \nonumber \ &= \mu + \phi \left( {{x_{t - 1}} - \mu } \right) \end{aligned} ]

This is the case since $E\left[ {{x_{t - 1}}|{x_{t - 1}}} \right] = {x_{t - 1}}$ and $E\left[ {{w_t}|{x_{t - 1}}} \right] = 0$

Now, the variance is:

[\begin{aligned} Var\left( {{x_t}|{x_{t - 1}}} \right) &= Var\left( {\mu + \phi \left( {{x_{t - 1}} - \mu } \right) + {w_t}|{x_{t - 1}}} \right) \hfill \ &= \underbrace {Var\left( {\mu + \phi \left( {{x_{t - 1}} - \mu } \right)|{x_{t - 1}}} \right)}{ = 0} + Var\left( {{w_t}|{x{t - 1}}} \right) \hfill \ &= Var\left( {{w_t}} \right) \hfill \ &= {\sigma ^2} \hfill \ \end{aligned} ]

Thus, we have: ${x_t}\sim N\left( {\mu + \phi \left( {{x_{t - 1}} - \mu } \right),{\sigma ^2}} \right)$.

Again, note that the distribution of $x_t$ is normal and, thus, the density function of $x_t$ is given by: [\begin{aligned} f\left( {{x_t}} \right) &= \sqrt {\frac{1}{{2\pi {\sigma ^2}}}} \exp \left( { - \frac{1}{{2{\sigma ^2}}} \cdot {{\left[ {\left( {{x_t} - \mu } \right) - \phi \left( {{x_{t - 1}} - \mu } \right)} \right]}^2}} \right) \hfill \ &= {\left( {2\pi } \right)^{ - \frac{1}{2}}}{\left( {{\sigma ^2}} \right)^{ - \frac{1}{2}}}\exp \left( { - \frac{1}{{2{\sigma ^2}}} \cdot {{\left[ {\left( {{x_t} - \mu } \right) - \phi \left( {{x_{t - 1}} - \mu } \right)} \right]}^2}} \right) \textrm{ [2]} \ \end{aligned} ]

And for this equation we'll call it [2].

MLE for $\sigma ^2$ on $AR(1)$ with mean $\mu$

Whew, with all of the above said, we're now ready to obtain an MLE estimate on an $AR(1)$.

Let $\vec{\theta} = \left[ {\begin{array}{*{20}{c}} \mu \ \phi \ {{\sigma ^2}} \end{array}} \right]$, then the likelihood of $\vec{\theta}$ is given by $x_1, \ldots , x_T$ is:

[\begin{aligned} L\left( {\vec \theta |{x_1}, \ldots ,{x_T}} \right) &= f\left( {{x_1}, \ldots ,{x_T}|\vec \theta } \right) \hfill \ &= f\left( {{x_1}} \right) \cdot \prod\limits_{t = 2}^T {f\left( {{x_t}|{x_{t - 1}}} \right)} \end{aligned} ]

The last equality is the result of us using a lag 1 of "memory." Also, note that $x_t | x_{t-1}$ must have $t > 1 \in \mathbb{N}$. Furthermore, we have dropped the parameters in the densities, e.g. $\vec{\theta}$ in $f(\cdot)$, to ease notation.

Using equations [1] and [2], we have:

[L\left( {\vec \theta |{x_1}, \ldots ,{x_T}} \right) = {\left( {2\pi } \right)^{ - \frac{T}{2}}}{\left( {{\sigma ^2}} \right)^{ - \frac{T}{2}}}{\left( {1 - {\phi ^2}} \right)^{\frac{1}{2}}}\exp \left( { - \frac{1}{{2{\sigma ^2}}}\left[ {\left( {1 - {\phi ^2}} \right){{\left( {{x_t} - \mu } \right)}^2} + \sum\limits_{t = 2}^T {{{\left[ {\left( {{x_t} - \mu } \right) - \phi \left( {{x_{t - 1}} - \mu } \right)} \right]}^2}} } \right]} \right)]

For convenience, we'll define:

[S\left( {\mu ,\phi } \right) = \left( {1 - {\phi ^2}} \right){\left( {{x_t} - \mu } \right)^2} + \sum\limits_{t = 2}^T {{{\left[ {\left( {{x_t} - \mu } \right) - \phi \left( {{x_{t - 1}} - \mu } \right)} \right]}^2}} ]

Fun fact, this is called the "unconditional sum of squares."

Thus, we will operate on:

[L\left( {\vec \theta |{x_1}, \ldots ,{x_T}} \right) = {\left( {2\pi } \right)^{ - \frac{T}{2}}}{\left( {{\sigma ^2}} \right)^{ - \frac{T}{2}}}{\left( {1 - {\phi ^2}} \right)^{\frac{1}{2}}}\exp \left( { - \frac{1}{{2{\sigma ^2}}}S\left( {\mu ,\phi } \right)} \right)]

Taking the log of this yields:

[\begin{aligned} l\left( {\vec \theta |{x_1}, \ldots ,{x_T}} \right) &= \log \left( {L\left( {\vec \theta |{x_1}, \ldots ,{x_T}} \right)} \right) \hfill \ &= - \frac{T}{2}\log \left( {2\pi } \right) - \frac{T}{2}\log \left( {{\sigma ^2}} \right) + \frac{1}{2}\left( {1 - {\phi ^2}} \right) - \frac{1}{{2{\sigma ^2}}}S\left( {\mu ,\phi } \right) \hfill \ \end{aligned} ]

Now, taking the derivative and solving for the maximized point gives:

[\begin{aligned} \frac{\partial }{{\partial {\sigma ^2}}}l\left( {\vec \theta |{x_1}, \ldots ,{x_T}} \right) &= - \frac{T}{{2{\sigma ^2}}} + \frac{1}{{2{\sigma ^4}}}S\left( {\mu ,\phi } \right) \hfill \ 0 &= - \frac{T}{{2{\sigma ^2}}} + \frac{1}{{2{\sigma ^4}}}S\left( {\mu ,\phi } \right) \hfill \ \frac{T}{{2{\sigma ^2}}} &= \frac{1}{{2{\sigma ^4}}}S\left( {\mu ,\phi } \right) \hfill \ {{ \sigma }^2} &= \frac{1}{T}S\left( {\mu ,\phi } \right) \hfill \ \end{aligned} ]

Thus, the MLE for ${\hat \sigma }^2 = \frac{1}{T}S\left( {\hat \mu ,\hat \phi } \right)$, where $\hat \mu$ and $\hat \phi$ are the MLEs for $\mu , \phi$ that are obtained numerically via either Newton Raphson or a Scoring Algorithm. (More details in a numerical recipe book.)

Conditional MLE on $AR(1)$ with mean $\mu$

A common strategy to reduce the dependency on numerical recipes is to simplify $l\left( {\vec \theta |{x_1}, \ldots ,{x_T}} \right)$ by using ${l^*}\left( {\vec \theta |{x_1}, \ldots ,{x_T}} \right)$:

[\begin{aligned} {l^*}\left( {\vec \theta |{x_1}, \ldots ,{x_T}} \right) &= \prod\limits_{t = 2}^T {\log \left( {f\left( {{x_t}|{x_{t - 1}}} \right)} \right)} \hfill \ &= \prod\limits_{t = 2}^T {\log \left( {{{\left( {2\pi } \right)}^{ - \frac{1}{2}}}{{\left( {{\sigma ^2}} \right)}^{ - \frac{1}{2}}}\exp \left( { - \frac{1}{{2{\sigma ^2}}} \cdot {{\left[ {\left( {{x_t} - \mu } \right) - \phi \left( {{x_{t - 1}} - \mu } \right)} \right]}^2}} \right)} \right)} \hfill \ &= - \frac{{\left( {T - 1} \right)}}{2}\log \left( {2\pi } \right) - \frac{{\left( {T - 1} \right)}}{2}\log \left( {{\sigma ^2}} \right) - \frac{1}{{2{\sigma ^2}}}\sum\limits_{t = 2}^T {{{\left[ {\left( {{x_t} - \mu } \right) - \phi \left( {{x_{t - 1}} - \mu } \right)} \right]}^2}} \hfill \ \end{aligned} ]

Again, for convenience, we'll define:

[{S_c}\left( {\mu ,\phi } \right) = \sum\limits_{t = 2}^T {{{\left[ {\left( {{x_t} - \mu } \right) - \phi \left( {{x_{t - 1}} - \mu } \right)} \right]}^2}} ]

Fun fact, this is called the "conditional sum of squares."

So, we will use:

[{l^*}\left( {\vec \theta |{x_1}, \ldots ,{x_T}} \right) = - \frac{{\left( {T - 1} \right)}}{2}\log \left( {2\pi } \right) - \frac{{\left( {T - 1} \right)}}{2}\log \left( {{\sigma ^2}} \right) - \frac{1}{{2{\sigma ^2}}}{S_c}\left( {\mu ,\phi } \right)]

Taking the derivative with respect to $\mu$ gives:

[\begin{aligned} \frac{\partial }{{\partial \mu }}{l^*}\left( {\vec \theta |{x_1}, \ldots ,{x_T}} \right) &= - \frac{1}{{2{\sigma ^2}}}\sum\limits_{t = 2}^T {2\left[ {\left( {{x_t} - \mu } \right) - \phi \left( {{x_{t - 1}} - \mu } \right)} \right]\left( {\phi - 1} \right)} \hfill \ &= \frac{{1 - \phi }}{{{\sigma ^2}}}\sum\limits_{t = 2}^T {\left[ {\left( {{x_t} - \mu } \right) - \phi \left( {{x_{t - 1}} - \mu } \right)} \right]} \hfill \ &= \frac{{1 - \phi }}{{{\sigma ^2}}}\sum\limits_{t = 2}^T {\left( {{x_t} - \phi {x_{t - 1}} - \mu \left( {1 - \phi } \right)} \right)} \hfill \ &= -\frac{{{{\left( {1 - \phi } \right)}^2}}}{{{\sigma ^2}}}\mu \left( {T - 1} \right) + \frac{{\left( {1 - \phi } \right)}}{{{\sigma ^2}}}\sum\limits_{t = 2}^T {\left( {{x_t} - \phi {x_{t - 1}}} \right)} \hfill \ \end{aligned} ]

Solving for $\mu^{*}$ gives:

[\begin{aligned} 0 &= \frac{\partial }{{\partial \mu }}{l^}\left( {\vec \theta |{x_1}, \ldots ,{x_t}} \right) \hfill \ 0 &= - \frac{{{{\left( {1 - \phi } \right)}^2}}}{{{\sigma ^2}}}{\mu ^}\left( {T - 1} \right) + \frac{{\left( {1 - {\phi ^}} \right)}}{{\sigma _^2}}\sum\limits_{t = 2}^T {\left( {{x_t} - {\phi ^}{x_{t - 1}}} \right)} \hfill \ \frac{{{{\left( {1 - {\phi ^}} \right)}^2}}}{{\sigma ^2}}{\mu ^}\left( {T - 1} \right) &= \frac{{\left( {1 - {\phi ^}} \right)}}{{\sigma _^2}}\sum\limits{t = 2}^T {\left( {{x_t} - {\phi ^}{x_{t - 1}}} \right)} \hfill \ {\mu ^}\left( {1 - {\phi ^}} \right)\left( {T - 1} \right) &= \sum\limits_{t = 2}^T {\left( {{x_t} - {\phi ^}{x_{t - 1}}} \right)} \hfill \ {\mu ^} &= \frac{1}{{\left( {1 - {\phi ^}} \right)\left( {T - 1} \right)}}\sum\limits_{t = 2}^T {\left( {{x_t} - {\phi ^}{x_{t - 1}}} \right)} \hfill \ {\mu ^} &= \frac{1}{{1 - {\phi ^}}}\left[ {\underbrace {\frac{1}{{T - 1}}\sum\limits_{t = 2}^T {{x_t}} }{ = {{\bar x}{\left( 2 \right)}}} - \underbrace {\frac{{{\phi ^}}}{{T - 1}}\sum\limits_{t = 2}^T {{x_{t - 1}}} }{ = {{\bar x}{\left( 1 \right)}}}} \right] \hfill \ {{\hat \mu }^} &= \frac{1}{{1 - {\phi ^}}}\left( {{{\bar x}{\left( 2 \right)}} - \phi {{\bar x}{\left( 1 \right)}}} \right) \hfill \ \end{aligned} ]

When $T$ is large, we have the following:

[\begin{aligned} {{\bar x}{\left( 1 \right)}} \approx \bar x &,{{\bar x}{\left( 2 \right)}} \approx \bar x \hfill \ \hfill \ {{\hat \mu }^} &= \frac{1}{{1 - {\phi ^}}}\left( {\bar x - {\phi ^}\bar x} \right) \hfill \ &= \frac{{\bar x}}{{1 - {\phi ^}}}\left( {1 - {\phi ^*}} \right) \hfill \ &= \bar x \hfill \ \end{aligned} ]

Taking the derivative with respect to $\sigma^2$ and solving for $\sigma^2$ gives: [\begin{aligned} \frac{\partial }{{\partial {\sigma ^2}}}{l^}\left( {\vec \theta |{x_1}, \ldots ,{x_T}} \right) &= - \frac{{\left( {T - 1} \right)}}{{2\sigma _^2}} + \frac{1}{{2\sigma ^4}}{S_c}\left( {\mu ,\phi } \right) \hfill \ 0 &= - \frac{{\left( {T - 1} \right)}}{{2\sigma _^2}} + \frac{1}{{2\sigma ^4}}{S_c}\left( {\mu ,\phi } \right) \hfill \ \frac{{\left( {T - 1} \right)}}{{2\sigma _^2}} &= \frac{1}{{2\sigma _^4}}{S_c}\left( {\mu ,\phi } \right) \hfill \ \hat \sigma _^2 &= \frac{1}{{T - 1}}{S_c}\left( {{{\hat \mu }^},{{\hat \phi }^}} \right) \hfill \ \end{aligned} ]

Taking the derivative with respect to $\phi$ gives: [\begin{aligned} \frac{\partial }{{\partial \phi }}{l^*}\left( {\vec \theta |{x_1}, \ldots ,{x_T}} \right) &= - \frac{1}{{2{\sigma ^2}}}\sum\limits_{t = 2}^T { - 2\left[ {\left( {{x_t} - \mu } \right) - \phi \left( {{x_{t - 1}} - \mu } \right)} \right]\left( {{x_{t - 1}} - \mu } \right)} \hfill \ &= \frac{1}{{{\sigma ^2}}}\sum\limits_{t = 2}^T {\left[ {{x_t} - \phi {x_{t - 1}} - \mu \left( {1 - \phi } \right)} \right]\left( {{x_{t - 1}} - \mu } \right)} \hfill \ &= \frac{1}{{{\sigma ^2}}}\sum\limits_{t = 2}^T {\left[ {{x_t}{x_{t - 1}} - \phi x_{t - 1}^2 - \mu \left( {1 - \phi } \right){x_{t - 1}} - \mu {x_t} + \mu \phi {x_{t - 1}} + {\mu ^2}\left( {1 - \phi } \right)} \right]} \hfill \ &= \frac{1}{{{\sigma ^2}}}\left[ \begin{gathered} \sum\limits_{t = 2}^T {{x_t}{x_{t - 1}}} - \phi \sum\limits_{t = 2}^T {x_{t - 1}^2} - \mu \left( {1 - \phi } \right)\left( {T - 1} \right){{\bar x}{\left( 1 \right)}} \hfill \ - \mu \left( {T - 1} \right){{\bar x}{\left( 2 \right)}} + \phi \mu \left( {T - 1} \right){{\bar x}_{\left( 1 \right)}} + {\mu ^2}\left( {1 - \phi } \right)\left( {T - 1} \right) \hfill \ \end{gathered} \right] \end{aligned}]

Solving for $\phi$ gives: [\begin{aligned} 0 &= \frac{\partial }{{\partial \phi }}{l^}\left( {\vec \theta |{x_1}, \ldots ,{x_T}} \right) \hfill \ 0 &= \sum\limits_{t = 2}^T {{x_t}{x_{t - 1}}} - {{\hat \phi }^}\sum\limits_{t = 2}^T {x_{t - 1}^2} - \left( {{{\bar x}{\left( 2 \right)}} - {{\hat \phi }^}{{\bar x}{\left( 1 \right)}}} \right)\left( {T - 1} \right){{\bar x}{\left( 1 \right)}} - \frac{{{{\bar x}_{\left( 2 \right)}} - {{\hat \phi }^}{{\bar x}{\left( 1 \right)}}}}{{1 - {{\hat \phi }^}}}\left( {T - 1} \right){{\bar x}_{\left( 2 \right)}} \ &+ {{\hat \phi }^}\frac{{{{\bar x}{\left( 2 \right)}} - {{\hat \phi }^}{{\bar x}_{\left( 1 \right)}}}}{{1 - {{\hat \phi }^}}}\left( {T - 1} \right){{\bar x}{\left( 1 \right)}} + {\left( {\frac{{{{\bar x}{\left( 2 \right)}} - {{\hat \phi }^}{{\bar x}_{\left( 1 \right)}}}}{{1 - {{\hat \phi }^}}}} \right)^2}\left( {1 - {{\hat \phi }^}} \right)\left( {T - 1} \right) \hfill \ &\vdots \hfill \ &{\text{Magic}} \hfill \ &\vdots \hfill \ {{\hat \phi }^} &= \frac{{\sum\limits{t = 2}^T {\left( {{x_t} - {{\bar x}{\left( 2 \right)}}} \right)\left( {{x{t-1}} - {{\bar x}{\left( 1 \right)}}} \right)} }}{{\sum\limits{t = 2}^T {{{\left( {{x_{t - 1}} - {{\bar x}_{\left( 1 \right)}}} \right)}^2}} }} \hfill \ \end{aligned} ]

When $T$ is large, we have:

[\begin{aligned} \sum\limits_{t = 2}^T {\left( {{x_t} - {{\bar x}{\left( 2 \right)}}} \right)\left( {{x_t} - {{\bar x}{\left( 1 \right)}}} \right)} &\approx \sum\limits_{t = 2}^T {\left( {{x_t} - \bar x} \right)\left( {{x_{t - 1}} - \bar x} \right)} \hfill \ \sum\limits_{t = 2}^T {{{\left( {{x_{t - 1}} - {{\bar x}{\left( 1 \right)}}} \right)}^2}} &\approx \sum\limits{t = 1}^T {{{\left( {{x_t} - \bar x} \right)}^2}} \hfill \ \hfill \ {{\hat \phi }^*} &= \frac{{\sum\limits_{t = 2}^T {\left( {{x_t} - {{\bar x}{\left( 2 \right)}}} \right)\left( {{x_t} - {{\bar x}{\left( 1 \right)}}} \right)} }}{{\sum\limits_{t = 2}^T {{{\left( {{x_{t - 1}} - {{\bar x}{\left( 1 \right)}}} \right)}^2}} }} \approx \frac{{\sum\limits{t = 2}^T {\left( {{x_t} - \bar x} \right)\left( {{x_{t - 1}} - \bar x} \right)} }}{{\sum\limits_{t = 1}^T {{{\left( {{x_t} - \bar x} \right)}^2}} }} = \hat \rho \left( 1 \right) \hfill \ \end{aligned} ]

Method of Moments

The goal behind the estimation with Method of Moments is to match the theoretical moment (e.g. $E\left[ {x_t^k} \right]$) with the sample moment (e.g $\frac{1}{n}\sum\limits_{i = 1}^n {x_i^k}$), where $k$ denotes the moment.

This method often leads to suboptimal estimates for general ARMA models. However, it is quite optimal for $AR(p)$.

Method of Moments - AR(p)

Consider an $AR(p)$ process represented by: [{x_t} = {\phi 1}{x{t - 1}} + \cdots + {\phi p}{x{t - p}} + {w_t}] where $w_t \sim N(0,\sigma^2)$

To begin, we find the Covariance of the process when $h > 0$: [\begin{aligned} Cov\left( {{x_{t + h}},{x_t}} \right) &\mathop = \limits^{\left( {h > 0} \right)} Cov\left( {{\phi 1}{x{t + h - 1}} + \cdots + {\phi p}{x{t + h - p}} + {w_{t + h}},{x_t}} \right) \hfill \ &= {\phi 1}Cov\left( {{x{t + h - 1}},{x_t}} \right) + \cdots + {\phi p}Cov\left( {{x{t + h - p}},{x_t}} \right) + Cov\left( {{w_{t + h}},{x_t}} \right) \hfill \ &= {\phi _1}\gamma \left( {h - 1} \right) + \cdots + {\phi _p}\gamma \left( {h - p} \right) \hfill \ \end{aligned} ]

Now, we turn our attention to the variance of the process:

[\begin{aligned} Var\left( {{w_t}} \right) &= Cov\left( {{w_t},{w_t}} \right) \hfill \ &= Cov\left( {{w_t},{w_t}} \right) + \underbrace {Cov\left( {{\phi 1}{x{t - 1}},{w_t}} \right)}{ = 0} + \cdots + \underbrace {Cov\left( {{\phi _p}{x{t - p}},{w_t}} \right)}{ = 0} \hfill \ &= Cov\left( {\underbrace {{\phi _1}{x{t - 1}} + \cdots + {\phi p}{x_p} + {w_t}}{ = {x_t}},{w_t}} \right) \hfill \ &= Cov\left( {{x_t},{w_t}} \right) \hfill \ &= Cov\left( {{x_t},{x_t} - {\phi 1}{x{t - 1}} - \cdots - {\phi p}{x_p}} \right) \hfill \ &= Cov\left( {{x_t},{x_t}} \right) - {\phi _1}Cov\left( {{x_t},{x{t - 1}}} \right) - \cdots - {\phi p}Cov\left( {{x_t},{x{t - p}}} \right) \hfill \ &= \gamma \left( 0 \right) - {\phi _1}\gamma \left( 1 \right) - \cdots - {\phi _p}\gamma \left( p \right) \hfill \ \end{aligned} ]

Together, these equations are known as the Yule-Walker equations.

Yule-Walker

Definition

Equation form:

[\begin{aligned} \gamma \left( h \right) &= {\phi _1}\gamma \left( {h - 1} \right) - \cdots - {\phi _p}\gamma \left( {h - p} \right) \hfill \ {\sigma ^2} &= \gamma \left( 0 \right) - {\phi _1}\gamma \left( 1 \right) - \cdots - {\phi _p}\gamma \left( p \right) \hfill \ h & = 1, \ldots, p \end{aligned} ]

Matrix form: [\begin{aligned} \Gamma \vec \phi &= \vec \gamma \hfill \ {\sigma ^2} &= \gamma \left( 0 \right) - {{\vec \phi }^T}\vec \gamma \hfill \ \hfill \ \vec \phi &= \left[ {\begin{array}{{20}{c}} {{\phi 1}} \ \vdots \ {{\phi _p}} \end{array}} \right]{p \times 1},\vec \gamma = \left[ {\begin{array}{{20}{c}} {\gamma \left( 1 \right)} \ \vdots \ {\gamma \left( p \right)} \end{array}} \right]{p \times 1},\Gamma = \left{ {\gamma \left( {k - j} \right)} \right}{j,k = 1}^p \ \end{aligned} ]

More aptly, the structure of $\Gamma$ looks like the following: $$\Gamma = {\left[ {\begin{array}{*{20}{c}} {\gamma \left( 0 \right)}&{\gamma \left( { - 1} \right)}&{\gamma \left( { - 2} \right)}& \cdots &{\gamma \left( {1 - p} \right)} \ {\gamma \left( 1 \right)}&{\gamma \left( 0 \right)}&{\gamma \left( { - 1} \right)}& \cdots &{\gamma \left( {2 - p} \right)} \ {\gamma \left( 2 \right)}&{\gamma \left( 1 \right)}&{\gamma \left( 0 \right)}& \cdots &{\gamma \left( {3 - p} \right)} \ \vdots & \vdots & \vdots & \ddots & \vdots \ {\gamma \left( {p - 1} \right)}&{\gamma \left( {p - 2} \right)}&{\gamma \left( {p - 3} \right)}& \cdots &{\gamma \left( 0 \right)} \end{array}} \right]_{p \times p}}$$

Note, that we are able to use the above equations to effectively estimate $\vec \phi$ and $\sigma ^2$.

[\left[ \begin{aligned} \hat{\vec{\phi}} &= {{\hat \Gamma }^{ - 1}}\hat{\vec{\gamma}} \hfill \ {{\hat \sigma }^2} &= \hat \gamma \left( 0 \right) - {{\hat{\vec{\gamma}}}^T}{{\hat \Gamma }^{ - 1}}\hat{\vec{\gamma}} \hfill \ \end{aligned} \right. \to {\text{Yule - Walker Estimates}}]

For the second equation, we are effectively substituting in the first equation for $\hat{\vec{\phi}}$, hence the quadratic form ${{\hat{\vec{\gamma}}}^T}{{\hat \Gamma }^{ - 1}}\hat{\vec{\gamma}}$.

With this being said, there are a few nice asymptotic properties that we obtain for an $AR(p)$.

  1. $\sqrt T \left( {\hat{\vec{\phi}} - \vec \phi } \right)\mathop \to \limits_{t \to \infty }^L N\left( {\vec 0,{\sigma ^2}{\Gamma ^{ - 1}}} \right)$
  2. ${\hat \sigma ^2}\mathop \to \limits^p {\sigma ^2}$

Yule-Walker estimates are optimal in the sense that they have the smallest asymptotic variance i.e. [Var\left( {\sqrt{T} \hat{\vec{\phi}} } \right) = {\sigma ^2}{\Gamma ^{ - 1}}] However, they are not necessarily optimal with small sample sizes.

Conceptually, the reason for this optimality result is a consequence from the linear dependence between moments and variables.

This is not true for MA or ARMA, which are both nonlinear and suboptimal.

Estimates

Consider $x_t$ as an $MA(1)$ process: ${x_t} = \theta {w_{t - 1}} + {w_t},{w_t}\mathop \sim \limits^{i.i.d} N\left( {0,{\sigma ^2}} \right)$

Finding the covariance when $h = 1$ gives: [\begin{aligned} Cov\left( {{x_t},{x_{t - 1}}} \right) &= Cov\left( {\theta {w_{t - 1}} + {w_t},\theta {w_{t - 2}} + {w_{t - 1}}} \right) \hfill \ &= Cov\left( {\theta {w_{t - 1}},{w_{t - 1}}} \right) \hfill \ &= \theta {\sigma ^2} \hfill \ \end{aligned} ]

Finding the variance (e.g. $h=0$) gives: [\begin{aligned} Cov\left( {{x_t},{x_t}} \right) &= Cov\left( {\theta {w_{t - 1}} + {w_t},\theta {w_{t - 1}} + {w_t}} \right) \hfill \ &= {\theta ^2}Cov\left( {{w_{t - 1}},{w_{t - 1}}} \right) + \underbrace {2\theta Cov\left( {{w_{t - 1}},{w_t}} \right)}_{ = 0} + Cov\left( {{w_t},{w_t}} \right) \hfill \ &= {\theta ^2}{\sigma ^2} + {\sigma ^2} \hfill \ &= {\sigma ^2}\left( {1 + {\theta ^2}} \right) \hfill \ \end{aligned} ]

This gives us the MA(1) ACF of: [\rho \left( h \right) = \left{ {\begin{array}{*{20}{c}} 1&{h = 0} \ {\frac{\theta }{{{\theta ^2} + 1}}}&{h = \pm 1} \end{array}} \right.]

With this in mind, let's solve for possible $\theta$ values: [\begin{aligned} \rho \left( 1 \right) &= \frac{\theta }{{{\theta ^2} + 1}} \hfill \ \Rightarrow \theta &= \left( {{\theta ^2} + 1} \right)\rho \left( 1 \right) \hfill \ \theta &= \rho \left( 1 \right){\theta ^2} + \rho \left( 1 \right) \hfill \ 0 &= \rho \left( 1 \right){\theta ^2} - \theta + \rho \left( 1 \right) \hfill \ \end{aligned} ]

Yuck, that looks nasty. Let's dig out an ol' friend from middle school known as the quadratic formula:

[\theta = \frac{{ - b \pm \sqrt {{b^2} - 4ac} }}{{2a}}]

Applying the quadratic formula leads to:

[\begin{aligned} a &= \rho \left( h \right), b = -1, c = \rho \left( h \right) \ \theta &= \frac{{1 \pm \sqrt {{1^2} - 4\rho \left( h \right)\rho \left( h \right)} }}{{2\rho \left( h \right)}} \hfill \ \theta &= \frac{{1 \pm \sqrt {1 - 4{{\left[ {\rho \left( h \right)} \right]}^2}} }}{{2\rho \left( h \right)}} \hfill \ \end{aligned} ]

Thus, we have two possibilities: [\begin{aligned} {\theta _1} &= \frac{{1 + \sqrt {1 - 4{{\left[ {\rho \left( h \right)} \right]}^2}} }}{{2\rho \left( h \right)}} \hfill \ {\theta _2} &= \frac{{1 - \sqrt {1 - 4{{\left[ {\rho \left( h \right)} \right]}^2}} }}{{2\rho \left( h \right)}} \hfill \ \end{aligned}]

To ensure invertibility, we mandate that $\left| {\rho \left( 1 \right)} \right| < \frac{1}{2}$. Thus, we opt for ${\theta _2}$.

So, our estimator is: [\hat \theta = \frac{{1 - \sqrt {1 - 4{{\left[ {\hat \rho \left( 1 \right)} \right]}^2}} }}{{2\hat \rho \left( 1 \right)}}]

Furthermore, it can be shown that:

[\sqrt T \left( {\hat \theta - \theta } \right)\mathop \to \limits_{T \to \infty }^L N\left( {0 ,\frac{{1 + {\theta ^2} + 4{\theta ^4} + {\theta ^6} + {\theta ^8}}}{{{{\left( {1 - {\theta ^2}} \right)}^2}}}} \right)]

So, this is not a really optimal estimator...

Prediction (Forecast)

One of the most interesting things in time series analysis is to predict the future unobserved values based on the observed values up to now. However, it is not possible if the underlying model is unknown, thus in this section we assume the time series $X_t$ is stationary and its model is known. Recall that the best linear predictor is mentioned when we calculate the PACF of AR model. Here is the definition of best linear predictor.

Definition Given a stationary time serie $X_1, X_2, \dots, X_t$, without loss of generality we assume it has mean, the best linear predictor of $X_{t+h}$ is the linear combination $\sum_{j=1}^t \alpha_j X_j$ that minimize $\mathbb{E}|X_{t+h} - Y|^2$ over all linera combinations $Y$ of $X_1, X_2, \dots, X_t$. The minimal value $\mathbb{E}|X_{t+h} - \sum_{j=1}^t \alpha_j X_j|^2$ is called the square prediction error.

Theorem (Projection Theorem) Let $\mathcal{M} \subset \mathcal{L}_2$ be a closed linear subspace of Hibert space. For every $y \in \mathcal{L}_2$, there exists a unique element $\hat{y} \in \mathcal{M}$ that minimizes $||y - l||^2$ over $l \in \mathcal{M}$. This element is uniquely determined by the requirements (1) $\hat{y} \in \mathcal{M}$ and (2) $(y - \hat{y}) \perp \mathcal{M}$.

Projection theorem natually leads to an equivalent way to find the best linear predictor by solving the predition equations, [ \mathbb{E}(X_{t+h} - \hat{X}{t+h}) = 0, \mbox{( it is trivil for TS with zero mean)}] and [ \mathbb{E} [(X{t+h} - \hat{X}_{t+h})X_j ] = 0, \mbox{ for } i = 1, \dots, t.]

Write $\mathbb{E}(X_{i}, X_{j})$ as $\gamma(|i - j|)$, predition equations can be represented by the following form \begin{equation} \begin{pmatrix} \gamma(0) & \gamma(1) & \cdots & \gamma(t-1) \ \gamma(1) & \gamma(0) & \cdots & \gamma(t-2) \ \vdots & \vdots & \ddots & \vdots \ \gamma(t-1) & \gamma(t-2) & \cdots &\gamma(0) \end{pmatrix}

\begin{pmatrix} \alpha_1 \ \vdots \ \alpha_t \end{pmatrix} = \begin{pmatrix} \gamma(t+h-1) \ \vdots \ \gamma(h) \end{pmatrix} \end{equation}.



coatless/ITS documentation built on May 13, 2019, 8:45 p.m.